pgvector 是 Postgres 的扩展,能在其中存储和查询向量,与 Postgres 无缝集成。它有向量存储、相似性搜索、集成 SQL 支持等核心功能,支持多种向量类型和相似性度量标准。适用于推荐系统等场景,使用前需创建应用获取数据库,可通过控制台或 SQL 语句启用扩展,还介绍了创建表、插入和查询数据、创建索引等操作及注意事项。
pgvector: Store and query vectors in Postgres. You can store vector data in Postgres along with other data and perform vector similarity searches while still taking advantage of all the powerful features Postgres has to offer.
The pgvector extension integrates seamlessly with Postgres - allowing users to utilize its capabilities within their existing database infrastructure. By integrating the power of vectorization processing into PostgreSQL, pgvector allows users to perform large-scale similarity searches, machine learning model predictions, and a variety of other operations directly in the database, as there is no need for separate data storage or complex data transfer processes.
1. Vector Storage
pgvector uses efficient binary encoding (e.g., Faiss's IVF index) to store and retrieve high-dimensional vectors, which greatly reduces storage space requirements and improves query speed.
A variety of vector types are supported, including Single-precision, Half-precision, Binary and Sparse Vectors, to meet the needs of different scenarios.
2. Similarity Search
Support multiple similarity metrics, such as L2 Distance, Inner Product, Cosine Distance, L1 Distance, so that users can choose the appropriate search method according to actual needs.
Provide indexing strategies such as HNSW and IVFFlat to optimize query performance.
3. Integrated SQL support
Fully compatible with standard PostgreSQL, users can use SQL statements to create, query and manage vector tables, thus simplifying the development process.
Provides a series of built-in functions , such as cosine similarity calculation , etc. , to facilitate direct similarity comparison in SQL queries .
4. Extensibility and Flexibility
As an open source project , pgvector allows developers to customize and extend it as needed .
Supports ACID transactions, point-in-time recovery, JOIN operations, and all other great features of PostgreSQL.
pgvector is suitable for a variety of application scenarios, including but not limited to:
Recommender systems: generating personalized recommendations in real-time by calculating the similarity of user behavior vectors inside a database.
Image recognition: store and retrieve image feature vectors to realize content-based image search.
Natural Language Processing: vector database can store a large amount of text vector data, realize fast text search and similarity matching through vector indexing technology, and build intelligent Q&A system or chatbot.
Log in to the MemFire Cloud platform and create a new application as shown below:
Once the application is created successfully, you get a Postgres database in the cloud.
1. Enable the vector extension via the console page.
(1) Go to the Database page in the Dashboard.
(2) Click Extensions in the sidebar.
(3) Search for “vector” and enable the extension.
Alternatively, extensions can be enabled/disabled by executing the following SQL statement on the SQL Editor page.
-- Example: enable the "vector" extension. create extension vector with schema extensions; -- Example: disable the "vector" extension drop extension if exists vector;
Here is a simple example showing how to use pgvector.
Creating a Vector Table
You can define vector columns when you create a new table:
CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT, vector_col vector(3) );
Insert vector data
For testing purposes, we insert some data:
-- 直接插入向量 INSERT INTO documents (content, vector_col) VALUES ('example content1', '[0.1, 0.2, 0.3]'); INSERT INTO documents (content, vector_col) VALUES ('example content2', '[0.3, 0.2, 0.3]'); INSERT INTO documents (content, vector_col) VALUES ('example content3', '[0.1, 0.3, 0.2]'); INSERT INTO documents (content, vector_col) VALUES ('example content4', '[0.2, 0.2, 0.3]'); INSERT INTO documents (content, vector_col) VALUES ('example content5', '[0.1, 0.1, 0.2]');
The results of the implementation are as follows:
Querying data
Some distance functions supported by pgvector:
Cosine_similarity (cosine_similarity or <=> operator): Calculates the cosine similarity between two vectors, which is usually used to measure their similarity in direction, with values ranging from -1 (not similar at all) to 1 (exactly the same direction).
Euclidean distance (L2 norm or <-> operator): computes the straight-line distance between two vectors, which is the actual distance between two points in a multidimensional space.
Manhattan distance (L1 norm or <+> operator): also known as the L1 distance, this is the sum of the distances in each dimension and is often used to represent the shortest path in a city neighborhood. Remarks: New in version 0.7.0;
inner_product (inner_product or <#> operator): computes the dot product of two vectors, which is the sum of the products of their corresponding elements, and can be used to measure the similarity of two vectors.
Example: get the distance from the vector '[0.3, 0.2, 0.3]'.
SELECT vector_col <-> '[0.3, 0.2, 0.3]' AS distance FROM documents
The results of the implementation are as follows:
Example: get the cosine similarity to the vector '[0.3, 0.2, 0.3]':
SELECT vector_col <=> '[0.3, 0.2, 0.3]' FROM documents;
The results of the implementation are as follows:
For inner products, multiply by -1 (since <#> returns a negative inner product).
SELECT (vector_col <#> '[0.3, 0.2, 0.3]') * -1 AS inner_product FROM documents;
The results of the implementation are as follows:
By default, pgvector performs an exact nearest neighbor search, providing perfect recall. You can add indexes to use approximate nearest-neighbor searches, which can be a trade-off between speed and recall. Unlike typical indexes, query results vary when you add approximate indexes.
The index types supported by pgvector are:
HNSW: has better query performance (in terms of speed-recall tradeoff) than IVFFlat, but has slower build time and uses more memory. In addition, since there is no training step like in IVFFlat, indexes can be created even if there is no data in the table.
IVFFlat: The IVFFlat index divides the vectors into lists and then searches for the list of subsets closest to the query vector. It is faster and uses less memory than HNSW construction, but is lower in terms of query performance (in terms of speed-recall tradeoff).
Next, we take the HNSW index as an example to introduce the creation of indexes to speed up query operations.
Creating an HNSW Index
Use the CREATE INDEX command with the USING clause to create an HNSW index.
CREATE INDEX documents_idx ON documents USING hnsw (vector_col);
HNSW indexing allows you to set a number of parameters to optimize the behavior of the index, such as ef_construct and M:
CREATE INDEX documents_idx ON documents USING hnsw (vector_col) WITH (M=16, ef_construct=200);
M is the number of levels in the tree, default is 16.
ef_construct is the search parameter used to construct the index, default is 64.
After creating an index, you can use it to speed up vector searches. For example, search using Euclidean distance:
SELECT * FROM documents ORDER BY vector_col <-> '[0.1, 0.2, 0.3]'::vector LIMIT 10;
Make sure that your version of PostgreSQL is compatible with the pgvector version.
Perform sufficient testing to verify the performance and effectiveness of the index before using it in a production environment.
Depending on your specific needs, you may need to use different distance metrics or index parameters.
In conclusion, pgvector is a powerful extension to the PostgreSQL database system that provides users with efficient vector storage and querying capabilities for processing large-scale, high-dimensional vector data. Whether it is machine learning, natural language processing or other fields that require efficient similarity search, pgvector can provide strong support.